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Abstract 

Existing MAP inference algorithms for determinantal 
point processes (DPPs) need to calculate determinants 
or conduct eigenvalue decomposition generally at the 
scale of the full kernel, which presents a great chal¬ 
lenge for real-world applications. In this paper, we in¬ 
troduce a class of DPPs, called BwDPPs, that are char¬ 
acterized by an almost block diagonal kernel matrix and 
thus can allow efficient block-wise MAP inference. Fur¬ 
thermore, BwDPPs are successfully applied to address 
the difficulty of selecting change-points in the prob¬ 
lem of change-point detection (CPD), which results in 
a new BwDPP-based CPD method, named BwDppCpd. 
In BwDppCpd, a preliminary set of change-point can¬ 
didates is first created based on existing well-studied 
metrics. Then, these change-point candidates are treated 
as DPP items, and DPP-based subset selection is con¬ 
ducted to give the final estimate of the change-points 
that favours both quality and diversity. The effective¬ 
ness of BwDppCpd is demonstrated through extensive 
experiments on five real-world datasets. 


Introduction 

The determinantal point processes (DPPs) are elegant prob¬ 
abilistic models for subset selection problems where both 
quality and diversity are considered. Formally, given a set of 
items y = {I,-'' DPP defines a probability mea¬ 

sure 7^ on 2^, the set of all subsets of y. For every subset 
y C 37 we have 

Vi^iY) cx det(Lv.), (1) 

where the L-ensemble kernel L is an by positive semi- 
definite matrix. By writing L = B^B as a Gram matrix, 
det(Ly) could be viewed as the squared volume spanned by 
the column vectors B^ for f C U. By defining B^ = qi4>i, a 
popular decomposition of the kernel is given as 

Lij = Qi (jii qj ) (2) 

where qi S K+ measures the quality (magnitude) of item i in 
y, and (pi C MP, II II = 1 can be viewed as the angle vector 
of diversity features so that cpf <pj measures the similarity 
between items i and j. It can be shown that the probability 
of including i and j increases with the quality of i and j 
and diversity between i and j. As a result, a DPP assigns 
high probability to subsets that are both of good quality and 
diverse (|Kulesza and Taskar 2012[). 



Figure 1; (a) A 10-sec part of a 2-min speech recording, 
shown with change-point candidates. Segments of different 
speakers or noises are plotted in different colors, (b) BwDPP 
kernel constructed for the whole 2-min recording, with the 
112 change-point candidates as BwDPP items. The white 
denotes non-zero entries while the black indicates zero. 


For DPPs, the maximum a posteriori (MAP) problem 
aigmaxycy det(Ly), aiming at finding the subset with 
highest probability, has attracted much attention due to its 
broad range for potential applications. Noting that this is an 
NP-hard problem (jKo, Lee, and Queyranne 1995 1 , a number 
of approximate inference methods have been purposed, in¬ 
cluding the greedy me thods for optimizing the submodular 
function logdet(Ly) ([Buchbinder et al. 2012 Nemhauser, 


Wolsey, a nd Fisher 1978|l, optimization via cont inuous re 

laxation (Gillenwater, Kulesza, and Taskar 2012) , and min 
imum Bayes risk decoding that minimizes the application- 
specific loss function (Kulesza and Taskar 2012)l. 


These existing methods need to calculate determinants or 
conduct eigenvalue decomposition. Both computations are 
taken at the scale of the kernel size N and with the cost 
of around 0{N^) time that become intolerably high when 
N become large, e.g. thousands. Nevertheless, we find that 
for a class of DPPs where the kernel is almost block diag¬ 
onal (Fig. [T] (b)), the MAP inference with the whole kernel 
could be replaced by a series of sub-inferences with its sub- 
kernels. Since the sizes of the sub-kernels become smaller, 
the overall computational cost can be significantly reduced. 
Such DPPs are often defined over a line where items are only 
similar to their neighbourhoods on the line and significantly 
different from those far away. Since the MAP inference for 
such DPPs is conducted in a block-wise manner, we refer to 
them as BwDPPs (block-wise DPPs) in the rest of the paper. 


The above observation is mainly motivated by the prob- 

































lem of change-point detection (CPD) that aims at detecting 
abrupt changes in time-series data (|Gustafsson and Gustafs- 


son 2000|l. In CPD, the period of time between two consecu¬ 


tive change-points, often referred to as a segment or a state 
is with homogeneous properties of interest (e.g. the same 
speaker in a speech (|Chen and Gopalakrishnan 1998 |l or the 
same behaviour in human activity data ( |Liu et al. 2013] l). Af¬ 
ter choosing a number of change-point candidates without 
much difficulty, we can treat these change-point candidates 
as DPP items, and select a subset from them to be our final 
estimate of the change-points. Each change-point candidate 
has its own quality of being a change-point. Moreover, the 
true locations of change-points along the timeline tend to be 
diverse, since states (e.g. speakers in Fig. [^(a)) would not 
change rapidly. Therefore, it is preferred to conduct change- 
point selection that incorporates both quality and diversity. 
DPP-based subset selection clearly suits this purpose well. 
Meanwhile, the corresponding kernel will then become al¬ 
most block diagonal (e.g. Fig.[T](b)), as neighbouring items 
are less diversified, and items far apart more diversified. In 
this case, the DPP becomes BwDPP. 

The problem of CPD have been actively studied for 
decades, where various CPD methods could be broadly clas¬ 
sified into Bayesian or frequentist approach. In Bayesian 
approach, the CPD problem is reduced to estimating the 
posterior distribution of the change-point locations given 
the time-series data (|Green 1995|). Other posteriors to be 


estimated include the 0/1 indicator sequence (Lavielle and 


Lebarbier 2001 


, and the “run length” (Adams and MacKay 


20071. Although many improvements were made, e.g. using 


advanced Monte Carlo method, the efficiency for estimating 
these posteriors is still a big challenge for real-world tasks. 

In frequentist approach, the core idea is hypothesis test¬ 
ing and the general strategy is to first define a metric 
(test statistic) by considering the observations over past 
and present windows. As both windows move forward, 
change-points are selected when the metric value exceeds 
a threshold. Some widely-used metrics include the cumula¬ 
tive sum (iBasseville, Nikiforov, and others 1993 1 , the gener¬ 
alized likelihood-ratio (Gustafsso n 1996|l, the Bayesian in - 
formation criterion (BIC) fChen an d Gopalakrishnan 1 9^, 
the Kullback Leibler divergence (|Delacourt and Wellekens 


2000[), and more recently, subspace-based metrics (|Ide 
and Tsuda 2007t |Kawahara, Yairi, and Machida 2007|, 
kernel-based metr ics (|Desobry, Davy, and Don carli 2005)1, 
and density-ratio (Kanamori, Suzuki, and Sugiyama 2010' 


Kawahara and Sugiyama 2012) l. While various metrics 
have been explored, how to choose thresholds and per¬ 
form change-point selection, which is also a determining 
factor for detection performance, is relatively less studied. 
Heuristic-based rules or procedures are dominant and not 
well-performed, e.g. selecting local peaks above a threshold 
( |Kawahara, Yairi, and Mac hida 2007 )l, disc arding the lower 
one if two peaks are close ( |Liu et al. 2013| l, or requiring the 
metric differences between change-points and their neigh¬ 
bouring valleys above a threshold (|Delacourt and Wellekens 
| 2000 | ). 

In this paper, we propose to apply DPP to address the dif¬ 
ficulty of selecting change-points. Based on existing well- 


studied metrics, we can create a preliminary set of change- 
point candidates without much difficulty. Then, we treat 
these change-point candidates as DPP items, and conduct 
DPP-based subset selection to obtain the final estimate of 
the change-points that favours both quality and diversity. 

The contribution of this paper is two-fold. First, we intro¬ 
duce a class of DPP, called BwDPPs, that are characterized 
by an almost block diagonal kernel matrix and thus can al¬ 
low efficient block-wise MAP inference. Second, BwDPPs 
are successfully applied to address the difficult problem of 
selecting change-points, which results in a new BwDPP- 
based CPD method, named BwDppCpd. 

The rest of the paper is organized as follows. After de¬ 
scribing brief preliminaries, we introduce BwDPPs and give 
our theoretical result on the BwDPP-MAP method. Next, we 
introduce BwDppCpd and present evaluation experiment re¬ 
sults on a number of real-world datasets. Finally, we con¬ 
clude the paper with a discussion on potential future direc¬ 
tions. 


Preliminaries 


Throughout the paper, we are interested in MAP infer¬ 
ence for BwDPPs, a particular class of DPP where the L- 
ensemble kernel L is almost block diagonaQ namely 


Li Al ■ ■• 0 

AJ’ Ij2 A 2 


L4 


0 



2 


Lm—1 
1 


-^m —1 

Lm 


(3) 


where the diagonal sub-matrices are sub- 

kernels containing DPP items that are mutually similar, and 
the off-diagonal sub-matrices A^ G are sparse sub¬ 

matrices with non-zero entries only at the bottom left, repre¬ 
senting the connections between adjacent sub-kernels. Fig. 
|^(a) gives a good example of such matrices. 

Let y be the set of all indices of L and let , • • • , be 
that of Li, • • • , Lm correspondingly. For any set of indices 
Ci,Cj C y, we use Lc. to denote the square sub-matrix 
indexed by Ci and Lci,Cj the IC^I x \Cj\ sub-matrix with 
rows indexed by Ci and columns by Cj. Following general 
notations, by L = diag(Li,..., L^) we mean the block di¬ 
agonal matrix L consisting of sub-matrices Li, ...,Lm and 
L ^ 0 means that L is positive semi-definite. 


MAP Inference for BwDPPs 

Strictly Block Diagonal Kernel 

We first consider the motivating case where the kernel is 
strictly block diagonal, i.e. all elements in the off-diagonal 
sub-matrices A^ are zero. It can be easily seen that the fol¬ 
lowing divide-and-conquer theorem holds. 

Theorem 1 For the DPP with a block diagonal kernel L = 
diag(Li, • • • , Lm) over ground set y = which is 

*Such matrices could also be defined as a particular class of 
block tridiagonal matrices, where the off-diagonal sub-matrices Ai 
only have a few non-zeros entries at the bottom left. 



























































partitioned correspondingly, the MAP solution can be ob¬ 
tained as: 

C = Cl U • • • U ( 4 ) 

where C = argmaxdet(Lc), and Ci = argmaxdet(LcJ. 

ccy CiCyi 

Theorem[T]tells us that the MAP inference with a strictly 
block diagonal kernel can be decomposed into a series of 
sub-inferences with its sub-kernels. In this way, the overall 
computation cost can be largely reduced. Noting that no ex¬ 
act DPP-MAP algorithms are available so far, any approxi¬ 
mate DPP-MAP algorithms could be used in a plug-and-play 
way for the sub-inferences. 


Almost Block Diagonal Kernel 


Now we analyze the MAP inference for BwDPP with an al¬ 
most block diagonal kernel as defined in ([^. Let C C y 
be the hypothesized subset to be selected from L and let 
Cl C 3 ^ 1 , • • • , Cm C ym be that from Li, • • • , corre¬ 
spondingly, where Ci = C n 3^^. Without loss of generality, 
we assume Lc. is invertibl^for i = 1, • • • , m. By defining 
Lci recursively as hci — 



one could rewrite the MAP objective function; det(Lc) 


— det(Lci) det(Lu™ 2 C 2 — 


= det(Lci) det( 
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( 6 ) 


where 0 represents zero matrix of appropriate size that fill 
the cotTesponding area with zeros. The key to the second 
equation above is Lc^ = 0 for i > 3, since L is an almost 
block diagonal kernel. Continuing this recursion, 

det(Lc) = •• • = n™idet(Lc'J. (7) 

Hence, the MAP objective function is reduced to; 

argmaxdet(Lc) = argmax Iliii ( 8 ) 

cey Cieyi,--- ,c„^ey^ 


As Lci depends on Ci,--- , Ci, we cannot opti¬ 
mize det(Lc 7 j ),•••, det(Lc’,„) separately. Alternatively, 
we provide an approximate method that optimize over 
Cl, • • • , Cm sequentially, named the BwDPP-MAP method, 
which is a depth-first greedy search method in essence. 
The BwDPP-MAP is described in Table [T] where 
argmax^ denotes optimizing over Ci 

with the value of Cj fixed as Cj for j = 1, • • • ,i—l, and the 
sub-kernej^L^;; is given similarly as Lci, namely hy. = 


U - Lc. 




i = 1 , 

i = 2, - ■ ■ ,m 


(9) 


One may notice that {'Ly.)ci is equivalent to Lci- 


Table 1; BwDPP-MAP Algorithm 

Input: L as defined in ([^; 

Output: Subset of items C. 

For: i = 1, • • • , m 

Compute ijy. via (|^; 

Perform sub-inference over Ci via 

Ci = argmax^^g3,^,p_p^_^.^^_... det((Lj;Jc.); 

Return: C = IJ™ ^ Ci. 

In conclusion, similar to the MAP inference with a strictly 
block diagonal kernel, by using BwDPP-MAP, the MAP in¬ 
ference for an almost block diagonal kernel can be decom¬ 
posed into a series of sub-inferences for the sub-kernels as 
well. There are four comments for this conclusion. 

First, it should be noted that the above BwDPP-MAP 
method is an approximate optimization method, even if each 
sub-inference step is conducted exactly. This is because hci 
depends on Ci, • • • ,Ci. We provide an empirical evaluation 
later, showing that through block-wise operation, the greedy 
search in BwDPP-MAP can achieve computation speed-up 
with marginal sacrifice of the accuracy. 

Second, by the following Lemma [T] we show that each 
sub-kernel Ly. is positive semi-definite, so that it is theo¬ 
retically guaranteed that we can conduct each sub-inference 
via existing DPP-MAP algorithms, e.g. the greedy DPP- 
MAP algorithm (Table ( |Gillenwater, Kulesza, and Taskar| 
2012| l. One may find the proof of Lemma[^in the appendix. 

Leiuiua 1 Ly. A 0, for i = 1, - ■ ■ ,m. 

Third, in order to apply BwDPP-MAP, we need to first 
partition a given DPP kernel into the form of an almost block 
diagonal matrix as defined in ([^. The partition is not unique. 
A trivial partition for an arbitrary DPP kernel is no partition, 
i.e., regarding the whole matrix as a single block. We leave 
the study of finding the optimal partition for further work. 
Here we provide a heuristic rule for partition, which is called 
7 -pattition and performs well in our experiments. 

Definition 1 (y-partition) A y-partition is defined by parti¬ 
tioning a DPP kernel L into the almost block diagonal form 
as defined in 0 with the maximum number of blocks (i.e. 
the largest possible where for every off diagonal ma¬ 
trix Ai, the size of its non-zero area is only at the bottom left 
and does not exceed 7 x 7 . 

A heuristic way to obtain 7 -partition for a kernel L is to 
first identify a series of non-overlapping dense square sub¬ 
matrices along the main diagonal as many as possible. Next, 
two adjacent square sub-matrices in the main diagonal are 
merged if the size of the non-zero area in their corresponding 
off-diagonal sub-matrix exceeds 7 x 7 . 

It should be noted that a kernel could be subject to 7 - 
partition in one or more ways with different values of 7 . 
By taking 7 -partition for a kernel with different values of 7 , 
we can obtain a balance between computation cost and op¬ 
timization accuracy. A smaller 7 implies smaller m achiev- 


^That simply assumes that we only consider the non-trivial sub¬ 
sets selected with a DPP kernel L, i.e. det(Lci) > 0. 

^Both Ly>^ and hy- are called sub-kernels. 


'^Generally speaking, a partition of a kernel of size N into m 
sub-kernels will approximately reduce the computational complex¬ 
ity rrf times. A larger m implies larger computation reduction. 














able in 7 -partition, and thus smaller computation reduction. 
On the other hand, a smaller 7 means smaller degree of in¬ 
teraction between adjacent sub-inferences, and thus better 
optimization accuracy. 

Fourth, an empirical illustration of BwDPP-MAP is given 
in Fig. [ 2 I where the greedy MAP algo rithm (Table 
( |Gillenwater, Kulesza, and Taskar 2012) is used for the 
sub-inferences in BwDPP-MAP. The synthetic kernel size 
is fixed as 500. For each realization, the area of non-zero 
entries in the kernel is first specified by uniformly ran¬ 
domly choosing the size of sub-kernels from [10, 30] and the 
size of the non-zero areas in off-diagonal sub-matrices from 
{0, 2,4, 6 }. Next, a vector Bj is generated for each item i 
separately, following standard normal distribution. Finally, 
for all non-zero entries (L^ 7 ^ 0 ) specified in the previous 
step, the entry value is given by Lij = BfBj. Fig. I (a) 
provides an example for such synthetic kernel. 

We generate 1000 synthetic kernels as described above. 
For each synthetic kernel, we take 7 -partition with 7 = 
0, 2,4, 6 , and then run BwDPP-MAP. The performance of 
directly applying the greedy MAP algorithm on the origi¬ 
nal unpartitioned kernel is used as baseline. The results in 
Fig.|^(b) show that BwDPP-MAP runs much faster than the 
baseline. With the increase of 7 , the runtime drops while the 
inference accuracy degrades within a tolerable range. 

Connection between BwDPP-MAP and its 
Sub-inference Algorithm 

Any DPP-MAP inference algorithm can be used in a plug- 
and-play fashion for the sub-inference procedure of BwDPP. 
It is natural to ask the connection between BwDPP-MAP 
and its corresponding DPP-MAP algorithm. The relation is 
given by the following result. 

Theorem 2 Let f be any DPP-MAP algorithm for BwDPP- 
MAP sub-inference, where f maps a positive semi-definite 
matrix to a subset of its indices, i.e. / : L G S_|_ Y <fy. 
BwDPP-MAP (table^ is equivalent to applying the follow¬ 
ing steps successively to the almost block diagonal kernel as 
defined in 0-- 

C'i = /(Ly;J, (10) 

and for i = 2, 

4 = cr,4,_inr = 0). (ii) 

where Ci-a-i = and the 

input of f is the conditional kerne^ 

The proof of Theorem is in the appendix. Theorem 
states that BwDPP-MAP is essentially a series of Bayesian 
belief updates, where in each update a conditional kernel is 
fed into / that contains the information of previous selection 

^The conditional distribution (over set y — A’’" — of the 

DPP defined by L, 

■Pl(f = c y, ny = 0 ), (12) 

is also a DPP ( [Kulesza and Taskar 2012| l, and the corresponding 
kernel, (L| A™ C Y, n y = 0j, is called the conditional ker¬ 
nel. 



Figure 2: (a) The top-left 100 x 100 entries from a 500 x 500 
synthetic kernel, (b) The log-probability ratio log(p/pref) 
and runtime ratio t/t^ef, obtained from using BwDPP-MAP 
on the same kernel with different 7 -partition, where and 
fret are the baseline performance of directly applying the 
greedy MAP algorithm on the original unpartitioned kernel. 
Results are averaged over 1000 kernels. The error bar repre¬ 
sents 99.7% confidence level. 

Table 2; Greedy DPP-MAP Algorithm 


Initialization: Set C ^ 0, (7 ^ y\ 

While U is not empty; 

i* ^ argmaxjg^; La', (7 ^ (7 U {**}; 

Compute L* = ( (L -P ,) - I; 

L^L*; 

Return: (7. 


\L Jc/ 


result. The equivalent form allows us to compare BwDPP- 
MAP directly with the method of applying / on the entire 
kernel. The latter does inference on the entire set y for one 
time, while the former does the inference on a sequence of 
smaller subsets ..., Concretely, in the i-th update, a 
subset yi is added to have the kernel L^Ji ^y.. Then the in¬ 
formation of previous selection result is incorporated into 
the kernel to generate the conditional kernel. Finally, the 
DPP-MAP inference is performed on the conditional kernel 
to select Ci from yi. 

BwDPP-based Change-Point Detection 

Let xi, • • • , xt be the time-series observations, where Xj G 
R-^ represents the ZJ-dimensional observation at time t = 
1, • • • ,T, and let x^-.j denote the segment of observations 
in the time interval [t, t]. We further use Xi, X2 to repre¬ 
sent different segments of observations at different intervals, 
when explicitly denoting the beginning and ending times of 
the intervals are not necessary. The new CPD method will 
build on existing metrics. A dissimilarity metric is denoted 
as d : (Xi, X 2 ) 1 —K, which measures the dissimilarity be¬ 
tween two arbitrary segments Xi and X 2 . 

Quality-Diversity Decomposition of Kernel 

Given a set of items y = {I,-- - ,N}, the DPP kernel L 
can be written as a Gram matrix L = B^B, where B^, the 
columns of B, are vectors representing items in y. 

A popular decomposition of the kernel is to define B^ = 
qifii, where qt G K"*" measures the quality (magnitude) of 
item i in y, and (pi G K^, ||0i|| = 1 can be viewed as the 















angle vector of diversity features so that (pf cpj measures the 
similarity between items i and j. Therefore, L is dehned as 
L = diag(q) * S * diag(q), (13) 

where q is the quality vector consisting of qi, and S is the 
similarity matrix consisting of Stj — The quality- 

diversity decomposition allows us to construct q and S sep¬ 
arately to address different concerns, which is utilized below 
to construct the kernel for CPD. 

BwDppCpd 

BwDppCpd is a two-step CPD method, described as follows. 

Step 1: Based on a dissimilarity metric d, a preliminary 
set of change-point candidates is created. Consider mov¬ 
ing a pair of adjacent windows, :x.t-w+i:t and xt+i-.t+w, 
along t = w, ■ ■ ■ ,T — w, where w is the size of local 
windows. Then, a large d value for the adjacent windows, 
i.e. d{-Kt_^+i:t,Xt+i,t+w), suggests that a change-point is 
likely to occur at time t. After we obtain the series of d val¬ 
ues, local peaks above the mean of the d values are marked 
and the corresponding locations, say ti,--- ,tN, are se¬ 
lected to form the preliminary set of change-point candidates 

y = {!,■■■ ,N}. 

Step 2: Treat the change-point candidates y = 
{1, • • • , N} as BwDPP items, and select a subset from them 
to be our final estimate of the change-points. 

The BwDPP kernel is built via quality-diversity decom¬ 
position. We use the similarity metric d once more to mea¬ 
sure the quality of a candidate change-point to be a true one. 
Specihcally, we define 

Qi = d{Xti.i:ti,^U:U+i), (14) 

The higher the value Qi is, the sharper contrast around the 
change-point candidate i, and the better quality of i. 

Next, the BwDPP similarity matrix is dehned to address 
the fact that the true locations of change-points along the 
timeline tend to be diverse, since states would not change 
rapidly. This is done by assigning high similarity score to 
items being close to each other. Specihcally, we dehne 

Sij = exp(-(fi - (15) 

where cr is a parameter representing the position diversity 
level. Finally, after taking 7 -partition of the kernel L into the 
almost block diagonal form, BwDPP-MAP is used to select 
a set of change-points that favours both quality and diversity 
(Fig. 13(b)). 

Discussion 

There is a rich studies of metrics for CPD problem. The 
choice of the dissimilarity metric d(Xi, X 2 ) is hexible and 
could be well-tailored to the characteristics of the data. We 
present two examples that are used in our experiments. 

• Symmetric Kullback-Leibler Divergence (SymKL): 

If the two segments Xi,X 2 to be compared are assumed 
to follow Gaussian processes, the SymKL metric is given; 

SymKL(Xi,X 2 )=tr(SiS 2 -i)+tr(S 2 Sri)- 

1 1 -r (lo) 

2 D-|-tr((Sj + S2 ){^ll — H2){pi — ^^2) ), 

where and S are corresponding sample mean and co- 
variance. 



(a) (b) 


Figure 3: An BwDppCpd example from Hasc. (a) Change- 
point candidates selected in Step 1 with their d scores (green 
cross), (b) Final estimate of change-points in step 2 with 
their d scores (green cross). 
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Figure 4; BwDppCpd results for Well-Log (a). Coal Mine 
Disaster (b), and DJIA (c). Green lines are detected changes. 


• Generalized Likelihood Ratio (GLR): 

Generally, the GLR metric is given by the likelihood ratio: 


GLR(Xi,X2) 


/:(Xi|Ai)/:(X2|A2) 

-^(Xi_ 2 |Ai^ 2 ) 


(17) 


The numerator is the likelihood that the two segments fol¬ 
lows two different models Ai and A 2 respectively, while 
the denominator is that two segments together (denoted 
as Xi 2 ) follows a single model Ai, 2 - In practice, we plug 
the maximium likelihood estimates (MLE) for the param¬ 
eters Ai, A 2 , and Ai^ 2 - E.g. if we assume that the time- 
series segment X = {a;i, • • • ,xm} follows a homoge¬ 
neous Poisson process, where Xi is the occurring time of 
the Lth event, i = 1, • • • , M. The log-likehood of X is 
/:(X|A) = (M-l)logA-(xM-ccijA (18) 
where the MLE of A is used, A = {M — 1 )/{xm — xi). 


Experiments 

The BwDppCpd method are evaluated on five real-world 
time-series data. Firstly, three classic datasets are exam¬ 
ined for CPD, namely Well-Log data. Coal Mine Disaster 
data, and Dow Jones Industrial Average Return (DJIA) data, 
where we set 7 = 0 due to the small data size. 

Next, we experiment with human activity detection and 
speech segmentation, where the data size becomes larger 





















































PRC% 

RCL% 

Fi 

BwDppCpd 

93.05 

87.88 

0.9039 

RuLSIF 

86.36 

83.84 

0.8508 


Table 3: CPD result on human activity detection data HASC. 
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Figure 5; The ROC curve of BwDppCpd and RuLISF. 


and there is no accurate model to characterize the data, mak¬ 
ing the CPD task harder. In both experiments, the numbers 
of DPP items varies from hundreds to thousands, where, ex¬ 
cept BwDPP-MAP, no other algorithms can perform MAP 
inference within a reasonable cost of time due to the large 
kernel scale. We set 7 = 3 for human activity detection and 
7 = 0 , 2 for speech segmentation to provide a comparison. 

As for the dissimilarity metric d, Poisson processes and 
GLR are used in Coal Mine Disaster and for other experi¬ 
ments, Gaussian models and SymKL are used. 


Well-Log Data 

Well-Log contains 4050 measurements of nuclear magnetic 
response taken during the drilling of a well. It is an example 
of varying Gaussian mean and the changes reflect the strati¬ 
fication of the earth’s crust ( |Adams and MacKay 2007 1. Out¬ 
liers are removed prior to the experiment. As shown in Fig. 
[^(a), all changes are detected by BwDppCpd. 


Coal Mine Disaster Data 

Coal Mine Disaster Parrett 1979| l, a standard dataset for 
testing CPD method, consists of 191 accidents from 1851 
to 1962. The occurring rates of accidents are believed to 
have changed a few times and the task is to detect them. The 
BwDppCpd detection result, as shown in Fig. (b), agrees 
with that in ( [Green 1995| ). 


1972-75 Dow Jones Industrial Average Return 

DJIA contains daily return rates of Dow Jones Industrial Av¬ 
erage from 1972 to 1975. It is an example of varying Gaus¬ 
sian variance, where the changes are caused by big events 
that have potential macroeconomic effects. Four changes in 
the data are detected by BwDppCpd, which are matched 
well with important events (Fig.|^(c)). Compared to (Adams 


and MacKay 2007), one more change is detected (the right¬ 


most), which corresponds to the date that 73-74 stock mar¬ 
ket crash endecQ This shows that the BwDppCpd discovers 
more information from the data. 


Human Activity Detection 

contains human activity data collected by portable 
three-axis accelerometers and the task is to segment the data 
according to human behaviour changes. Fig.|^(b) shows an 
example of Hasc. The performance of the best algorithm 

®http://en. wikipedia.org/wiki/1973-74_stock_market_crash 
’http://hasc.jp/hc2011/ 


in ( Liu et al. 2013| l, RuLSIF, is used for comparison and 
the precision (PRC), recall (RCL), and Fi measure ( Kotti,| 
Moschou, and Kotropoulos 2008 1 are used for evaluation: 
PRC = CFC/DET, RCL = CFC/GT, (19) 

Fi= 2 PRC RCL/ (PRC -f RCL), (20) 

where CFG is the number of correctly found changes, DET 
is the number of detected changes, and GT is the number of 
ground-truth changes. Fi score could be viewed as a overall 
score that balances PRL and RCL. The CPD result is shown 
in Table where the parameters are set to attain the best Fi 
results for both algorithms. 

The receiver operating characteristic (ROC) curve is often 
used to evaluate performance under different precision and 
recall, where true positive rate (TPR) and false positive rate 
(FPR) are given by TPR = RCL and FPR = 1 - PRC. For 
BwDppCpd, different levels of TPR and FPR are obtained 
by tuning the position diversity parameter tr and for RuLSIF 
by tuning the threshold rj ( |Liu et al. 2013 1 . 

As shown in Table and Fig.[^ BwDppCpd outperforms 
RuLISF on HASC when the FPR is low. RuLISF has a bet¬ 
ter performance only when FPR exceeds 0.3, which is less 
useful. 


Speech Segmentation 

We tested two datasets for speech segmentation. The first 
dataset, called Hub4m97, is a subset (around 5 hours) from 
1997 Mandarin Broadcast News Speech (HUB4-NE) re¬ 
leased by LDCj^ The second dataset, called TelRecord, con¬ 
sists of 216 telephone conversations, each around 2 -min 
long, collected from real-world call centres. Acoustic fea¬ 
tures of 12-order MFCCs (mel-frequency cepstral coeffi¬ 
cients) are extracted as the time-series data. 

Speech segmentation is to segment the audio data into 
acoustically homogeneous segments, e.g. utterances from a 
single speaker or non-speech portions. The two datasets con¬ 
tain utterances with hesitations and a variety of changing 
background noises, presenting a great challenge for CPD. 

The BwDppCpd method with different 7 for kernel par¬ 
tition (denoted as Bw -7 in Tabl e |4| is tested and two clas- 
sic segmentation methods BIC (jChen and Gopalakrishnanl 
[T998l l and DISTBIC (Delacourt and Wellekens 2000T ^ 
used for comparison. As the same as in (Delacourt and 
Wellekens 2000), a post-processing step based on BIC val¬ 
ues is also taken to reduce the false alarms for BwDppCpd. 

The experiment results in Table shows that BwDppCpd 
outperforms BIC and DISTBIC for both datasets. In addi¬ 
tion, comparing the results obtained with 7 = 0 and 7 = 2, 
using 7 = 2 is found to be faster but has a slightly worse per¬ 
formance. This agrees with our analysis of BwDPP-MAP for 
using different 7 -partition to tradeoff speed and accuracy. 

Conclusion 

In this paper, we introduced B wDPPs, a class of DPPs where 
the kernel is almost block diagonal and thus can allow ef¬ 
ficient block-wise MAP inference. Moreover, BwDPPs are 

*http://catalog.ldc.upenn.edu/LDC98S73 










































BIC 1 DistBIC 1 Bw-0 1 Bw-2 

Hub4m97 

PRC% 

59.40 

64.29 

65.29 

65.12 

RCL% 

78.24 

74.98 

78.49 

78.39 

Fi 

0.6753 

0.6922 

0.7128 

0.7114 

TelRecord 

PRC% 

54.05 

61.39 

66.54 

66.47 

RCL% 

79.97 

81.72 

85.47 

84.83 

Fi 

0.6451 

0.7011 

0.7483 

0.7454 


Table 4; Segmentation results on Hub4m97 and TelRecord. 

demonstrated to be useful in change-point detection prob¬ 
lem. The BwDPP-based change-point detection method, 
BwDppCpd, shows superior performance in experiments 
with several real-world datasets. 

The almost block diagonal kernels suit the change-point 
detection problem well, but BwDPPs may achieve more than 
that. Theoretically, BwDPP-MAP could be applied to any 
block tridiagonal matrices without modification. It remains 
to be studied the theoretical issues regarding exact or ap¬ 
proximate partition of a DPP kernel into the form of an al¬ 
most block diagonal matrix ( [Acer, Kayaaslan, and Aykanat] 
|2013[ ). Other potential BwDPP applications are also worth 
further exploration. 


Appendix: Proof of Lemma 


Proof Define 


S* = 


-‘Vi- 


-■Ti+i [^yi+i,yi+2 0 ] 

0]^ L, 

^yi- 


i = 0 

z = 1, ■ 


+i,yi+2 ”J 

i = m — 1 

. ( 21 ) 

• • , m — 1, S* is the Schur complement of Lct in 
,, s, the sub-matrix of We next prove the 

1-^j ) 

lemma using the first principle of mathematical induction. 
State the predicate as: 


For i = 1 

Q? —1 

®CiU(U- 


P{i): S* ^ and are positive semi-definite (PSD). 


Proof The proof is given by mathematical induction. 
When n = 1, the result trivially holds: 

(23) 

(24) 


Assume the result holds for n = i — 1, i.e.. 
Consider the case when n = i. One has 








-1 

-1 
Ci ■ 


Therefore the result holds for z = 1,..., rrz. 

To prove Theorem]^ it suffices to show that 


Ly, = (L, 

Using 


( 


c Y,Ci.,_iny = 

one has 


3). 


\Ci:-i C y, n y = 0 ) 




Jy 


-1 


-I 




= Lv, - U 




- 2 


-■y ^Ci-i,y 

Following Lemmaj^to complete the proof 

RHS = Ly. - LT 3,^ = L^.. 
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